Skip to main content

Bias-Complexity Trade-off

Error Decomposition

To put it in the simplest fashion:

LD(hS)=ϵapp+ϵest  where:ϵapp=minhHLD(h).\begin{gather*} L_D(h_S) = \epsilon_{app} + \epsilon_{est} \; where:\epsilon_{app}=\min_{h\in H}L_D(h). \end{gather*}

Enlarging the hypothesis class will decrease the approximation error, but might increase estimation error, as a rich HH might lead to overfitting. On the other hand, chooseing HH to be a very small set reduces the estimation error but might increase the approximation error or, in other words, might lead to underfitting.

The No-Free-Lunch Theorem

Let AA be any learning algorithm for binary classification over XX and let mX/2m \le |X|/2. Then there exists a distribution D over X×{0,1}X \times \{0, 1\} such that:

  1. There exists a function f:X{0,1}withLD(f)=0f:X \to \{0, 1\} \: with \: L_D(f) = 0
  2. With probability of at least 1/71/7 over the choice of SDmS \sim D^m we have that LD(A(S))1/8L_D(A(S)) \ge 1/8.

Proof

Let the set CC of size 2m2m be given, where CC is a subset of the domain: CX2mC \in X^{2m}.
Next, consider all possible T=22mT = 2^{2m} labelling functions which map from C{0,1}C \to \{0, 1\}, where C=2m|C| = 2m.
For each fif_i, define a distribution DiD_i such that DiD_i is uniform over CC, with labels given by fif_i. Denote DimD_i^m to be the distribution of mm i.i.d. samples from DiD_i.
Clearly, LDi(fi)=0L_{D_i}(f_i)=0.
Next, we want to show:

maxi[T]ESDim[LDi(A(S))]1/4.\begin{gather*} \max_{i \in [T]} \mathbb{E}_{S \sim D_i^m}[L_{D_i}(A(S))] \ge 1/4. \end{gather*}

Side track

Show that this inequality implies i[T]\exists i \in [T] such that with probability at least 1/7 over training set SDimS \sim D_i^m, we have LDi(A(S))1/8L_{D_i}(A(S)) \ge 1/8.

Proof (using Markov's inequality)

P(Xa)E[X]a1a\begin{gather*} P(X \ge a) \ge {\mathbb{E}[X]-a \over 1-a} \end{gather*}

with X=LDi(A(S))X=L_{D_i}(A(S)), and a=1/8a=1/8.

Back on track. There are k=(2m)mk=(2m)^m possible sequences of mm examples from CC. Denote these sequences by S1,,SkS_1,\cdots,S_k. Also, if by the function fif_i. Therefore,

ESDim[LDi(A(S))]=1kj=1kLDi(A(Sji)).\begin{gather*} \mathbb{E}_{S\sim D_i^m}[L_{D_i}(A(S))]={1 \over k} \sum_{j=1}^k L_{D_i}(A(S_j^i)). \end{gather*}

Using the facts that max is greater than average is greater than min, we have

1kj=1kLDi(A(Sji))1Ti=1T1kj=1kLDi(A(Sji))=1kj=1k1Ti=1TLDi(A(Sji))minj[k]1Ti=1TLDi(A(Sji)).\begin{align*} {1 \over k} \sum_{j=1}^k L_{D_i}(A(S_j^i)) & \ge {1\over T}\sum_{i=1}^T {1 \over k} \sum_{j=1}^k L_{D_i}(A(S_j^i)) \\ & = {1 \over k} \sum_{j=1}^k {1\over T}\sum_{i=1}^T L_{D_i}(A(S_j^i)) \\ & \ge \min_{j\in [k]} {1\over T}\sum_{i=1}^T L_{D_i}(A(S_j^i)). \end{align*}

Next, fix any SjS_j of size mm. Then there are pmp \ge m samples v1,,vpCv_1, \cdots, v_p \in C that do not appear in SS. Clearly, pmp\ge m. Therefore, we have

LDi(h)=12mxC1[h(x)fi(x)]12mr=1p1[h(vr)fi(vr)]12pr=1p1[h(vr)fi(vr)].\begin{align*} L_{D_i}(h) & = {1\over 2m}\sum_{x\in C} \mathbb{1} [h(x) \ne f_i(x)] \\ & \ge {1\over 2m}\sum_{r=1}^p \mathbb{1} [h(v_r) \ne f_i(v_r)] \\ & \ge {1\over 2p}\sum_{r=1}^p \mathbb{1} [h(v_r) \ne f_i(v_r)]. \end{align*}

Hence,

1Ti=1TLDi(A(Sji))1Ti=1T12pr=1p1[A(Sji)(vr)fi(vr)]=12pr=1p1Ti=1T1[A(Sji)(vr)fi(vr)]12minr[p]1Ti=1T1[A(Sji)(vr)fi(vr)].\begin{align*} {1\over T}\sum_{i=1}^T L_{D_i}(A(S_j^i)) & \ge {1\over T}\sum_{i=1}^T {1\over 2p}\sum_{r=1}^p \mathbb{1} [A(S_j^i)(v_r) \ne f_i(v_r)] \\ & = {1\over 2p}\sum_{r=1}^p {1\over T}\sum_{i=1}^T \mathbb{1} [A(S_j^i)(v_r) \ne f_i(v_r)] \\ & \ge {1\over 2} \min_{r\in [p]} {1\over T}\sum_{i=1}^T \mathbb{1} [A(S_j^i)(v_r) \ne f_i(v_r)]. \end{align*}

Partition fif_i into T/2T/2 pairs, where each pair (fi,fi)(f_i, f_{i'}) agrees on everything except vlv_l. Every pair (fi,fi)(f_i, f_{i'}) produces same labelled datasets Sji,SjiS_j^i, S_j^{i'}.

1[A(Sji)(vr)fi(vr)]+1[A(Sji)(vr)fi(vr)]1Ti=1T1[A(Sji)(vr)fi(vr)]=12.\begin{gather*} \mathbb{1} [A(S_j^i)(v_r) \ne f_i(v_r)] + \mathbb{1} [A(S_j^i)(v_r) \ne f_{i'}(v_r)] \\ {1\over T}\sum_{i=1}^T \mathbb{1} [A(S_j^i)(v_r) \ne f_i(v_r)] = {1\over 2}. \end{gather*}

We can complete the proof with the following

1Ti=1TLDi(A(Sji))121TT2=14.\begin{gather*} {1\over T}\sum_{i=1}^T L_{D_i}(A(S_j^i)) \ge {1\over 2}{1\over T}{T\over 2} = {1\over 4}. \end{gather*}

Corollary

Let XX be an infinite domain set (Rd)(\mathbb{R}^d) and let HH be all possible functions from X{0,1}X \to \{0, 1\}. Then, HH is not PAC-learnable.

Proof

Assume, by way of contradiction, that the class is learnable. Choose some ϵ<1/8\epsilon < 1/8 and δ<1/7\delta < 1/7. By the definition of PAC learnability, there must be some learning algorithm AA and an integer m=m(ϵ,δ)m=m(\epsilon, \delta), such that for any data-generating distribution over X×{0,1}X \times \{0,1\}, if for some function f:X{0,1},LDf=0f:X\to \{0,1\}, L_D{f} = 0, then with probability greater than 1δ1-\delta when AA is applied to samples SS of size mm, generated i.i.d. by D,LD(A(S))ϵD, L_D(A(S)) \le \epsilon. However, applying the No-Free-Lunch theorem, since X>2m|X|>2m, for every learning algorithm, there exists a distribution DD such that with probability greater than 1/7>δ,LD(A(S))>1/8>ϵ1/7 > \delta, L_D(A(S)) > 1/8 > \epsilon, which leads to the desired contradiction.